Skip to content

Conversation

@yanboshao
Copy link
Contributor

@yanboshao yanboshao commented Jan 26, 2026

Motivation

If the output pointer of other ranks can be obtained within the kernel, data can be directly written to the remote rank's output during stage 2, reducing one access to local HBM.

Technical Details

In graph mode, broadcast the output pointer on the host side.

Test Plan

Test Result

Dtype: bf16
Device: Mi308 * 8
CudaGraph: on

Shape Old (μs) New (μs) Ratio (%)
440x5120 44.85 36.92 17.68%
512x5120 46.00 43.54 5.35%
512x7168 62.32 54.62 12.36%
512x8192 66.25 61.50 7.16%
632x5120 53.12 47.91 9.81%
680x5120 58.03 54.04 6.87%

Dtype: bf16
Device: Mi325*8
CudaGraph: on

Shape Old (μs) New (μs) Ratio (%)
440x5120 33.40 32.00 4.18%
512x5120 38.19 37.77 1.11%
512x7168 53.33 51.41 3.61%
512x8192 57.32 55.91 2.47%
632x5120 45.45 43.91 3.38%
680x5120 48.68 46.94 3.57%

Dtype: bf16
Device: Mi355*8
CudaGraph: on

Shape Old (μs) New (μs) Ratio (%)
440x5120 24.91 24.98 -0.25%
512x5120 28.47 28.55 -0.27%
512x7168 38.49 38.71 -0.57%
512x8192 42.71 42.64 0.17%
632x5120 34.12 33.95 0.50%
680x5120 36.33 36.33 0.00%

Submission Checklist

@yanboshao yanboshao requested a review from a team January 26, 2026 15:12
@yanboshao yanboshao changed the title optimize allreduce write mode by broadcast output addr optimize allreduce write mode by broadcast output ptr Jan 26, 2026
Copy link
Collaborator

@valarLip valarLip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@valarLip valarLip self-assigned this Jan 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants